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Abstract — A common practice in microarray analysis is 
to transform the microarray raw data (light intensity) by 
a logarithmic transformation, and the justification for this 
transformation is to make the distribution more symmetric 
and Gaussian-like. Since this transformation is not universally 
practiced in all microarray analysis, we examined whether the 
discrepancy of this treatment of raw data affect the "high level" 
analysis result. In particular, whether the differentially ex- 
pressed genes as obtained by t-test, regularized t-test, or logistic 
regression have altered rank orders due to presence or absence 
of the transformation. We show that as much as 20%— 40% of 
significant genes are "discordant" (significant only in one form 
of the data and not in both), depending on the test being used 
and the threshold value for claiming significance. The t-test is 
more likely to be affected by logarithmic transformation than 
logistic regression, and regularized t-test more affected than 
t-test. On the other hand, the very top ranking genes (e.g. up 
to top 20-50 genes, depending on the test) are not affected by 
the logarithmic transformation. 

I. INTRODUCTION 

The number of copies of single-stranded messenger-RNA 
(mRNA) can be used to infer the amount of protein product 
produced by certain gene, and is called the "expression 
level". Ideally, one would like to count the number of copies 
of certain mRNA directly. But in microarray chips, the 
amount of a specific mRNA is measured indirectly by the 
emission of fluorescence light. It is necessary to transform 
the raw data of light intensity obtained by optical detection 
to a summarized quantity that indicates the expression level. 
Deriving the expression level from raw data is called the 
"low-level" analysis, and it can be complicated by the details 
of the technology and chip platform [1], [2]. Reaching con- 
clusions such as the determination of differentially expressed 
genes using the expression level data is called the "high- 
level" analysis. 

After the expression level is derived from the raw 
data, another preprocessing step is commonly practiced: 
log-transformation. The standard motivation for the log- 
transformation is that the distribution of the derived ex- 
pression level is typically asymmetric with long tail at 
the high expression end. Many parametric statistical tests 
require variables to follow a Gaussian/normal distribution. 

W. Li is a Research Scientist with the Robert S Boas Center for 
Genomics and Human Genetics, Feinstein Institute for Medical Re- 
search, North Shore LIJ Health System, Manhasset, NY 11030, USA 
wli@nslij-genetics .org 

Y.J. Suh is a Research Professor of The Research Institute of Nat- 
ural Sciences, Sookmyung Women's University, Seoul 140-742, Korea. 
yjsprite@yahoo.co.kr 

J. Zhang is a Senior Statistician at Forest Research Institute, Jersey City, 
NJ 07311, USA jingshan . zhang@f rx . com 



The log-transformation is an attempt to convert an asym- 
metric distribution to a symmetric and Gaussian-like one. 
Other transformations for the purpose of "normality" are 
also possible [3], such as square-root, Box-Cox [4], and 
arcsine transformations. In microarray data, transformations 
were proposed along the line of variance stabilization [5], 
[6] 

A novel alternative explanation of the use of log- 
transformation might be that human perceive brightness 
of light as the logarithm of light energy, similar to our 
perceiving loudness of sound as the logarithm of sound 
intensity. In general, all human perception of physical stimuli 
is proportional to the logarithm of amount of stimuli, under 
the names of Weber- Fechner's law [7], [8] and Steven's 
law [9]. For the light-intensity-derived expression level, log- 
transformation can be viewed as a way to measure the 
"perception signal" from the data. 

From the statistical point of view, logarithm transformation 
can take down an outlier with extreme high value, thus 
affecting the group mean. On the other hand, logarithm 
transformation or any 1-to-l transformation will not shuffle 
the relative order of expression values, thus will not affect a 
rank-based test result such as Wilcoxon-Mann- Whitney test 
[10]. For a specific test or statistical model, the effect of 
log-transformation on the result is not clear, even though 
we know it has no effect if the test is rank-based, and has 
some effects if there are outliers. For linear classifiers, the 
violation of Gaussian distribution affect some methods more 
(e.g. Fisher's linear discriminant analysis, perceptron) but 
less so on other methods (e.g., logistic regression, support 
vector machine) [11]. 

Another note on investigating the effect of log- 
transformation is that one can focus either on the whole 
list of genes, or only on the more interesting top ranking 
genes. For example, with a log-transformation, the top 1 and 
2 differentially expressed genes may be switched while the 
rank of all other genes are unchanged. Even though the effect 
of log-transformation on the whole list of genes could be 
small, the minor rearrangement of the top ranking genes can 
be crucial in designing the subsequent experiments such as 
gene validation by real-time PCR. 

We will examine the effect of log transformation on two 
or three simple methods for selecting differentially expressed 
genes on a real microarray dataset. Log-transformation is just 
one factor that change the apparent value of data, there are 
other factors as well such as the normalization procedure 
during the "low-level" analysis, change of the probe set 
design, change of the microarray platform, etc. 
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Fig. 1 . Minus log of p-values of tests on log transformed vs. original data. 
The x axis is — log 10 (p- value) for the original expression data, and y axis 
is — log 10 (p- value) for the log-transformed data. The top plot is for logistic 
regression and bottom plot for t-test. The four quadrants as split by x = 5 
and y = 5 are indicated. Each point represents a gene. 

II. METHODS AND DATA 

A. Student's t-test 

The Student's i-test is used here as a representative of tests 
that make assumption on variable normality. We expect the 
normality requirement is met better for the log-transformed 
data than the original data. The ^-statistic is defined as the 
ratio of the difference of two group means and the standard 
error of this difference: t = (E\ — E2)/ ysyni + s\jni, 
where E12, 2 > n i,2 are the mean, variance, and sample 
size of group 1 and 2. The p-value given a i-statistic value 
is determined by the Student's t-distribution with degree 
of freedom df. Usually, df is equal to n\ + 712 — 2, but 
when the variances in two groups are not equal, a more 
complicated formula for df can be used [12], We use 
such a method as implemented in the R statistical package 
( http://www.r-project.org/). 

B. Logistic regression 

Logistic regression is used to represent statistical models 
which do not have a strong normality requirement. The 
advantage for models or tests lacking such a requirement 
is that these are more robust. The disadvantage for models 
without the normality requirement is that when the variable 
is in fact distributed as Gaussian, these are less "efficient" 
as classifiers [13]. The significance of a single-gene logistic 
regression can be determined by a likelihood-ratio test: (- 
2) log-maximum-likelihood of the logistic regression model 
subtract that of a null model follows a % 2 distribution with 
one degree of freedom, under the null hypothesis. Thus given 
the (-2) log-likelihood ratio (called "deviance"), the p-value 
can be determined using the \ 2 distribution. 



TABLE I 

PERCENTAGE OF DISCORDANT GENES: (I+IV)/(I+II+IV) 
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C. Regularized t-test and significance analysis of microar- 
rays (SAM) 

Since low expression level also leads to low variance, t- 
statistic can be high due to low expression level. Penalized 
or regularized statistics add an extra term sq to prevent 
this small variance from inflating the statistic: d — (E\ — 
-^2)/(V 'sf/ni + S2/112 + SQ). SAM (significance analysis of 
microarray) is a method for determining the value of so. [14]. 
SAM test statistic, d-score, was calculated by the SAM pack- 
age obtained from http://www-stat.stanford.edu^tibs/SAM/ 

D. Microarray data 

The illustrative microarray data is a profiling study of 
rheumatoid arthritis. There are 43 patients and 48 normal 
controls, which is more than the 29 patients and 21 controls 
used in the previous publication [15]. The mRNA was 
extracted from the peripheral blood mononuclear cells. The 
microarray data is obtained from the Affymetrix HG-U133A 
GeneChip with 22,283 genes/probe-sets, and was normalized 
by the Affymetrix microarray suite (MAS) program. 

III. RESULTS 

A. Proportion of discordant differentially expressed genes 

Fig^ shows the minus log of p-values of log-transformed 
expression data vs that of un-log-transformed (raw) expres- 
sion data, for both logistic regression (top) and i-test (bot- 
tom). Taking all genes as a whole, the two sets of p-values are 
highly correlated (correlation coefficients are 0.94 and 0.93, 
respectively). In order to highlight the differences, especially 
for the high-ranking differentially expressed genes, we split 
the plot into four quadrants by a vertical line at x — a and 
horizontal line at y = a. The parameter a = —logio(po) 
corresponds to gene selection threshold po for p-values. For 
example, the a = 5 in Fig Q corresponds a p-value threshold 
of po = 0.00001. 

The genes in quadrants I, II, and IV have at least one p- 
value of the two (log and raw data) smaller than po, whereas 
the genes in quadrant II have both p-values smaller than po. If 
log-transformation has no effect on the gene selection, there 
will be no points in quadrants I and IV. We use the percentage 
of points in I and IV out of all points in 1,11, IV as a measure 
of the inconsistency between the test results on raw and 
log-transformed data. If points in quadrants I and IV are 
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Fig. 2. Rank difference d as a function of averaged rank R a for all 22283 
genes (A,B,C) and for top-400 genes (D,E,F). Both rank difference d and 
averaged rank R a concern the same gene on two different types of data (raw 
and log-transformed). (A) and (D) are results for logistic regression, (B) and 
(E) are for t-test, (C) and (F) for SAM. The z-axis in (D,E,F) is in log scale 
to highlight the top-ranking genes. In (D,E,F), d = 50, -50, 100, -100 and 
d = R a , d = —R a lines are drawn. 

called "discordant" and those in quadrant II "concordant", 
this measure is the percentage of discordant genes among 
all differentially expressed genes by either one type of data. 

Table U shows the discordant percentage and their 95% 
confidence intervals (CI) at various gene selection threshold 
p (=1CT 9 , 1CT 4 , 0.001, 0.01). As expected, the t-test 
result is more affected by the log transformation than logistic 
regression: at all po threshold values, the percentage of 
discordant differentially expressed genes is higher in t-test 
than in logistic regression. The average discordant percentage 
at eight po values is 27% for logistic regression and 44% for 
t-test. 

It was however surprising that for logistic regression, 
except for the extremely differentially expressed genes (e.g., 
when p-value < 10~ 9 , the discordant percentage is zero), the 
discordant percentage is not negligible. If either one of the 
raw or log-transformed data is used for logistic regression 
analysis, as much as 10%-20% of the claimed differentially 
expressed genes will not be claimed so by another data. 

B. Ranking change due to log transformation 

The effect of log-transformation can also be examined by 
the ranking of a gene in both datasets. If log-transformation 
has no effect, the rank of a gene by (e.g.) p-value will be 
unchanged. We use the notation R n {i), Ri{i) for the rank of 
gene-i in the raw and log-transformed data, and define R a (i) 
as the average of the two: R a (i) = (R n (i) + Ri(i))/2, and 
d(i) as the rank difference: d(i) ~ R n {i) — Ri{i)- Fig|2] 
(A,B,C) show d vs. R a for logistic regression, t-test, and 
SAM (genes are ranked by absolute value of the d-score) 
for all 22283 genes. 



FigEl(A, B,C) indicate that for the whole gene set there is 
a similar pattern for all three test-statistics: for high- and low- 
ranking genes, they are high and low ranked in both raw and 
log-transformed data (thus smaller rank differences). As the 
majority of genes are not differentially expressed, the overall 
scattering pattern in Fig|2](A,B,C) may not be as interesting 
as the behavior near the high-ranking differentially expressed 
genes. 

To focus on the top-ranking genes, Fig|2](D,E,F) zoom in 
for the top-400 genes (x-axis is in log scale). First, we notice 
that for the very top genes (e.g. up to top-10), the ranking is 
unchanged or changed very little by the log transformation in 
any one of the three tests/models. Second, t-test has reached 
rank-difference of d = 50 and d = 100 sooner (i.e., at a 
higher ranking) than logistic regression, reconfirming our 
previous conclusion that t-test is more likely to be affected 
by log transformation than logistic regressions. Using the 
d = R a and d = —R a envelope, we see that points are more 
likely to be outside the envelopes for t-test than the logistic 
regression. The third observation is that SAM test result is 
affected even more by log transformation than t-test. In Fig|2] 
(F), many points are far outside the envelope region. 

IV. CONCLUSIONS AND FUTURE WORKS 

A. Conclusions 

Using one microarray dataset, we have shown that log 
transformation may affect results on selecting differentially 
expressed genes. If we call all genes that are significant by 
tests on either raw or log-transformed data "differentially 
expressed genes", and those genes that are significant in 
test of only one of the two types of data "discordant", 
the discordant as a proportion of the all (discordant and 
concordant) differentially expressed genes can be as high 
as 27% for logistic regression and 44% for t-test. The 
larger discordant percentage for t-test confirms our general 
understanding that tests that require variable normality are 
more likely to be affected by variable transformation. 

B. Future Works 

We plan to extend the results here to other public domain 
microarray datasets and to other tests, models, and measures 
for determining differentially expressed genes. 
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